While it is possible to fully compile this RMarkdown document, during the workshop it is advisable to run the lines of code individually and make changes to the analysis parameters to help in learning and exploring.
You will need to install:
Latest development version of StabMap package from Github.
If using anndata object from Parts I-II, zellkonverter package from Bioconductor.
If using reducedDimGirafe function, then the ggiraph package from CRAN.
library(StabMap)
library(SingleCellExperiment)
library(BiocNeighbors)
library(scran)
library(scuttle)
library(batchelor)
library(bluster)
library(scater)
library(zellkonverter)
library(S4Vectors)
library(plotly)
library(ggplot2)
library(patchwork)
library(ggiraph)
library(RColorBrewer)
library(grDevices)
In addition, we read in the helpful functions for this workshop. Note
the function plotReducedDimGirafe requires the package
ggiraph.
source("workshop_functions.R")
We are using the data from 10x Genomics’ Breast Cancer Xenium
platform and scFFPE dissociated single cell profile from the matching
sample. These data objects have been prepared from Part I of the
workshop and can be read in as SingleCellExperiment objects
using the readH5AD() function in the
zellkonverter package.
We assume the data files are in the data folder above
this repository folder. You may need to edit the lines below to access
the correct files.
The sce object corresponds to the single-cell
(dissociated) data, while spe object corresponds to the
spatial omics data. We remove the cells in the spe object
that are Unlabeled.
sce = readH5AD("../data/scFFPE_raw.h5ad", X_name = "counts")
spe = readH5AD("../data/xenium_rep1.h5ad", X_name = "counts")
spe <- spe[,!spe$Cluster %in% c("Unlabeled")]
To enable faster processing during this workshop, cells are selected
randomly from the sce and spatialexpression
spe data objects.
set.seed(2024)
spe <- spe[,sample(colnames(spe), 10000)]
sce <- sce[,sample(colnames(sce), 5000)]
This segment normalizes and transforms the gene expression data, preparing it for downstream analysis. It includes steps like logarithmic normalization and highly variable gene selection. We select the 1,000 most highly variable genes.
spe <- logNormCounts(spe)
sce <- logNormCounts(sce)
stats <- modelGeneVar(sce, assay.type = "logcounts")
hvgs = getTopHVGs(stats, n = 1000)
sce <- sce[hvgs,]
The below code computes the mean expression of spatially neighboring
cells in the spe data. It enriches the dataset with spatial
context, aiding in spatial analysis and interpretation.
The custom function getNeighbourMean is used to
calculate, for each cell, the mean logcount expression of the nearest 5
cells, and setting includeSelf = FALSE indicates that the
expression of the cell is not included in the mean calculation.
We will include the logcounts_neighbours assay to the
existing spe object so that we can use the
plotReducedDim function to easily visualise both sets of
features side-by-side, in this case for the ERBB2 gene.
We are also going to concatenate these newly
computed neighbourhood features to the per-cell logcounts of our
spe object. We will retain this data matrix as the
concatenated_logcounts object.
logcounts_neighbours = getNeighbourMean(spe,
assayName = "logcounts",
spatialReducedDim = "spatial",
kval = 5,
includeSelf = FALSE)
assay(spe, "logcounts_neighbours", withDimnames = FALSE) <- logcounts_neighbours
concatenated_logcounts = rbind(assay(spe, "logcounts"), logcounts_neighbours)
plotReducedDim(spe, "spatial", colour_by = "ERBB2", by.assay.type = "logcounts") +
coord_fixed() +
ggtitle("logcounts") +
plotReducedDim(spe, "spatial", colour_by = "ERBB2", by.assay.type = "logcounts_neighbours") +
coord_fixed() +
ggtitle("logcounts_neighbours")
First we like to check that there are no duplicate column names
between the sce and spe objects. Since the
data will be integrated, the data need to have unique column names for
correct mapping along the mosaic data topology. The value below should
be 0.
length(intersect(colnames(sce), colnames(spe)))
## [1] 0
Below are four chunks that give different settings of the mosaic
integration. Note that the compiled version of this report will only run
Setting 1, you can edit this by changing the code chunk parameter
eval=FALSE to eval=TRUE and vice versa.
The most relevant considerations for mosaic integration are:
Which datasets are being integrated?
Which features are being included in mosaic integration?
Which datasets are to be considered as reference datasets?
Whether data needs to be centred and scaled prior to mosaic integration.
For the remainder of this workshop, we will select a few different mosaic integration settings and examine the quality and characteristics of the integration results.
Here we use only the per-cell logcount expression, and we treat both datasets as reference datasets. Later on, we will need to reweight the embedding to assign equal weighting to the reference datasets.
stab = stabMap(assay_list = list(
"sce" = assay(sce, "logcounts"),
"spe" = assay(spe, "logcounts")),
reference_list = c("sce", "spe"),
plot = FALSE)
Here we use only the per-cell logcount expression, and we treat only
the sce dataset as the reference dataset. This means that
the spe dataset has no influence on the StabMap embedding ,
and is only projected onto the embedding.
stab = stabMap(assay_list = list(
"sce" = assay(sce, "logcounts"),
"spe" = assay(spe, "logcounts")),
reference_list = c("sce"),
plot = FALSE)
This setting is similar to Setting 2 but we choose spe
as the reference dataset.
stab = stabMap(assay_list = list(
"sce" = assay(sce, "logcounts"),
"spe" = assay(spe, "logcounts")),
reference_list = c("spe"),
plot = FALSE)
In this setting, we use the concatenated_logcounts data
for our spatial integration, and use both datasets as reference
datasets. This means that the neighbourhood features that we calculated
from spe are being used in estimating our StabMap
embedding.
stab = stabMap(assay_list = list(
"sce" = assay(sce, "logcounts"),
"spe" = concatenated_logcounts),
reference_list = c("spe", "sce"),
plot = FALSE)
Can you think of additional Settings? Add your own chunks to explore.
Now that we have selected a Setting for our StabMap mosaic data integration, we may need to reweight contributions from reference datasets.
In any case where we select more than one reference dataset, it may be the case that one dataset has more variation present than another. We need to consider reweighting the contributions of the multiple reference datasets so that we do not inadvertently give too large a priority for the variation present in one reference dataset than another. This becomes important when we use methods that calculate euclidean or similar distances, e.g. for clustering.
In practise, this is done by forcing the same total L1 norm for the embedding dimensions stemming from each reference dataset.
We also have freedom to assign custom weights to each dataset. In this way, we can moderate our prior belief of the quality or comprehensiveness of each reference dataset.
Here we use the default to reweight equally.
stab_reweighted = reWeightEmbedding(stab)
stab_reweighted = reWeightEmbedding(stab,
weights = c("sce_PC" = 0.8, "spe_PC" = 0.2))
We can visualise the difference by plotting the L1 norms for each embedding dimension before (black open) and after (blue solid) re-weighting.
plot(colSums(abs(stab)),
xlab = "Embedding dimension number",
ylab = "Total L1 norm",
ylim = c(0, max(colSums(abs(cbind(stab, stab_reweighted))))))
points(colSums(abs(stab_reweighted)), col = "blue", pch = 16)
In this part we will explore downstream analysis after StabMap mosaic data integration.
Now that we have performed mosaic integration using StabMap, it is
very likely there are still be technical effects between the cells
coming from the sce and spe datasets. Because
StabMap gives us a common low-dimensional embedding, we can implement
any horizontal integration method that takes the
low-dimensional embeddings as input. Examples of such methods include
Harmony, scVI, and Mutual Nearest Neighbours (MNN).
In this case, we are going to use the MNN method implemented in the
reducedMNN function. We do this by providing the reweighted
StabMap embedding as two distinct objects for the sce and
spe derived cells.
mnn_out = reducedMNN(stab_reweighted[colnames(sce),],
stab_reweighted[colnames(spe),])
stabmap_reweighted_corrected = mnn_out[["corrected"]]
We can use the imputeEmbedding function to impute
feature values for a given set of query cells. The function requires the
original data for calculating the imputed values, the stabMap embedding
to identify the nearest cells, and for the reference and query cell
names to be given.
In any of the Imputation cases, we create a
joint_assay_XXX data matrix that we can use for further
visualisation or differential expression testing.
Note that we are free to perform multiple sets of imputation. In general it is a good idea to match the mosaic data integration Setting with the specific Imputation choice, e.g. Setting 4 uses the neighbourhood expression features and Imputation 2 imputes the neighbourhood expression of cells, but this is not actually required.
Note that each imputation will take a few minutes to run, we will run both Imputation settings.
In this Imputation we use the corrected StabMap embedding to impute
gene expression of the spatial cells. Then, we combine the imputed and
observed logcounts. Because we are going to combine Imputation 1 and 2,
we add “_sceRef” to the feature names that are imputed using the
sce cells.
imp_sceRef = imputeEmbedding(assay_list = list(sce = assay(sce, "logcounts"),
spe = assay(spe, "logcounts")),
embedding = stabmap_reweighted_corrected,
reference = colnames(sce),
query = colnames(spe))
joint_assay_sceRef = cbind(assay(sce, "logcounts"), imp_sceRef[["sce"]])
rownames(joint_assay_sceRef) <- paste0(rownames(joint_assay_sceRef),"_sceRef")
In this Imputation we use impute the per-cell and neighbourhood
expression, treating the spe cells as reference and the
sce cells as query. As above, we combine the imputed and
observed features. Similar to above, because we are going to combine
Imputation 1 and 2, we add “_seeRef” to the feature names that are
imputed using the sce cells.
imp_speRef = imputeEmbedding(assay_list = list(sce = assay(sce, "logcounts"),
spe = concatenated_logcounts),
embedding = stabmap_reweighted_corrected,
reference = colnames(spe),
query = colnames(sce))
joint_assay_speRef = cbind(imp_speRef[["spe"]], concatenated_logcounts)
rownames(joint_assay_speRef) <- paste0(rownames(joint_assay_speRef),"_speRef")
We can use the SingleCellExperiment object class to
combine our information from each data source. This will give us a
convenient object to perform visualisation and other downstream tasks
like clustering or differential expression.
However, SingleCellExperiment requires that assays
within a single object have the same features. Between the two
Imputations this is not guaranteed, so we will concatenate these
features.
We can create the column metadata using the combineRows
function, and add a column indicating which dataset it belongs to.
joint_cData = combineRows(colData(sce), colData(spe))
joint_cData$dataset <- ifelse(rownames(joint_cData) %in% colnames(sce), "sce", "spe")
joint_assay = rbind(joint_assay_sceRef[,rownames(joint_cData)],
joint_assay_speRef[,rownames(joint_cData)])
Now we take these components and create the SingleCellExperiment object. To avoid overplotting when visualising later, we will randomly shuffle the cell order in the object.
jointSCE = SingleCellExperiment(
assays = list(imputed = joint_assay),
reducedDims = list(
StabMap = stabmap_reweighted_corrected[rownames(joint_cData),],
spatial = cbind(joint_cData$x_centroid, joint_cData$y_centroid)
),
colData = joint_cData
)
jointSCE <- jointSCE[,sample(colnames(jointSCE))]
jointSCE
## class: SingleCellExperiment
## dim: 1626 15000
## metadata(0):
## assays(1): imputed
## rownames(1626): COL3A1_sceRef CD74_sceRef ... ZEB2_neighbours_speRef
## ZNF562_neighbours_speRef
## rowData names(0):
## colnames(15000): 138648 CAAGGCCAGGTTCGAG-1 ... 59633 126840
## colData names(13): Cluster sizeFactor ... replicate dataset
## reducedDimNames(2): StabMap spatial
## mainExpName: NULL
## altExpNames(0):
In preparation for the next part, we define a colour scheme for the cell types and clusters.
First set a colour scheme for the clusters.
clusterCount = length(unique(jointSCE$Cluster))
clusterPalette = colorRampPalette(brewer.pal(9, "Set1"))(clusterCount)
We can take the StabMap low dimensional embedding and further reduce
this to just two dimensions using UMAP. Then we can visualise these
using the plotUMAP function from the scater
package. The function generates a ggplot object, which can
be further customised using the + operator.
g1 visualises the cells coloured by dataset, while
g2 visualises the cells according to their annotated cell
type.
jointSCE <- runUMAP(jointSCE, dimred = "StabMap")
g1 = plotUMAP(jointSCE, colour_by = "dataset", point_size = 0.5) +
scale_colour_brewer(palette = "Set1") +
theme(legend.position = "bottom") +
guides(colour = guide_legend(title = "", override.aes = list(size = 3)))
g1
g2 = plotUMAP(jointSCE, colour_by = "Cluster", point_size = 0.5) +
scale_colour_manual(values = clusterPalette) +
theme(legend.position = "bottom") +
guides(colour = guide_legend(title = "", override.aes = list(size = 3)))
g2
We can note that the ‘stromal’ label is transcriptionally diverse.
We can use the StabMap embedding to perform joint clustering of the
sce and spe cells. We use the function
clusterRows() from the bluster package to do
so. Note that setting the clustering parameter to
NNGraphParam() indicates that graph-based nearest neighbour
clustering will be used.
clus = clusterRows(reducedDim(jointSCE, "StabMap"), NNGraphParam())
jointSCE$joint_cluster = clus
We can visualise the new joint clusters using
plotUMAP().
jointclusterCount = length(unique(jointSCE$joint_cluster))
jointclusterPalette = colorRampPalette(brewer.pal(9, "Set1"))(jointclusterCount)
g3 = plotUMAP(jointSCE, colour_by = "joint_cluster", point_size = 0.5) +
scale_colour_manual(values = jointclusterPalette) +
theme(legend.position = "bottom") +
guides(colour = guide_legend(title = "", override.aes = list(size = 3)))
g3
In addition to visualising the UMAP embedding from StabMap, we can
use the plotReducedDim() function to plot the spatial
locations of the spe cells, coloured by the annotated cell
types and the new joint clusters.
g4 = plotReducedDim(jointSCE, "spatial",
colour_by = "Cluster", point_size = 0.5) +
scale_colour_manual(values = clusterPalette) + coord_fixed() +
theme(legend.position = "bottom") +
guides(colour = guide_legend(title = "", override.aes = list(size = 3)))
g4
g5 = plotReducedDim(jointSCE, "spatial",
colour_by = "joint_cluster", point_size = 0.5) +
scale_colour_manual(values = jointclusterPalette) + coord_fixed() +
theme(legend.position = "bottom") +
guides(colour = guide_legend(title = "", override.aes = list(size = 3)))
g5
We can also plot the spatial coordinates according to the imputed gene expression values. For example, the gene FCGR2A is not measured in the Xenium experiment, but we can use our StabMap embedding to obtain our best estimate.
gene = "FCGR2A"
g6 = plotReducedDim(jointSCE, "spatial",
colour_by = paste0(gene, "_sceRef"),
by_exprs_values = "imputed")
g6
Interestingly, we can also use the imputation of the spatial
neighbours to visualise our best guess of neighbourhood expression of
our sce cells. For example, we can take the gene CD14,
which is a marker for immune cells, and examine the imputed
neighbourhood expression. In this case, it’s helpful for us to facet our
visualisation by the dataset to check our results.
gene = "CXCR4"
g7 = plotReducedDim(jointSCE, "UMAP",
colour_by = paste0(gene, "_neighbours", "_speRef"),
by_exprs_values = "imputed",
other_fields = list("dataset")) +
facet_wrap(~dataset)
g7
Our visualisation of imputed neighbourhood expression suggests there is high neighbourhood expression among B cells, but also among a subset of stromal cells.
While not specifically about StabMap, when performing integration of
single cell and spatial data, it is very helpful to work with
interactive visualisation tools. One example is plotly
package, and in particular the ggplotly() function. We can
take any plot we generated above and visualise interactively, and hover
over the cells to examine further.
ggplotly(g2)
For data where we have access to muliple low-dimensional embeddings,
we can use the custom function plotReducedDimGirafe() that
is provided as part of this workshop. We can use the lasso tool to
identify which cells are present in different locations.
plotReducedDimGirafe(jointSCE)
This workshop aimed to guide through integration of single-cell RNA sequencing (scRNA-seq) data with spatial omics data using the StabMap package. By leveraging various settings for mosaic data integration and reweighting contributions from reference datasets, we have covered how to effectively combine information from dissociated single-cell profiles and spatial expression data.
Further resources:
Orchestrating single-cell analysis using Bioconductor https://bioconductor.org/books/release/OSCA/
StabMap package website https://marionilab.github.io/StabMap/
StabMap Paper https://www.nature.com/articles/s41587-023-01766-z
Contact shila.ghazanfar@sydney.edu.au
Thank you to the following for their careful feedback on this workshop:
sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.5.2
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Australia/Sydney
## tzcode source: internal
##
## attached base packages:
## [1] grDevices stats graphics utils stats4 methods base
##
## other attached packages:
## [1] Matrix_1.5-4.1 slam_0.1-50
## [3] RColorBrewer_1.1-3 ggiraph_0.8.7
## [5] patchwork_1.1.2 plotly_4.10.2
## [7] zellkonverter_1.11.1 scater_1.29.0
## [9] ggplot2_3.4.2 bluster_1.11.1
## [11] batchelor_1.17.0 scran_1.29.0
## [13] scuttle_1.11.0 BiocNeighbors_1.19.0
## [15] SingleCellExperiment_1.23.0 SummarizedExperiment_1.31.1
## [17] Biobase_2.61.0 GenomicRanges_1.53.1
## [19] GenomeInfoDb_1.37.2 IRanges_2.35.2
## [21] S4Vectors_0.39.1 BiocGenerics_0.47.0
## [23] MatrixGenerics_1.13.0 matrixStats_1.0.0
## [25] StabMap_0.1.8 igraph_1.5.0
##
## loaded via a namespace (and not attached):
## [1] rstudioapi_0.14 jsonlite_1.8.5
## [3] magrittr_2.0.3 ggbeeswarm_0.7.2
## [5] farver_2.1.1 rmarkdown_2.22
## [7] zlibbioc_1.47.0 vctrs_0.6.3
## [9] DelayedMatrixStats_1.23.0 RCurl_1.98-1.12
## [11] htmltools_0.5.5 S4Arrays_1.1.4
## [13] SparseArray_1.1.10 sass_0.4.6
## [15] bslib_0.5.0 htmlwidgets_1.6.2
## [17] basilisk_1.13.1 plyr_1.8.8
## [19] cachem_1.0.8 ResidualMatrix_1.11.0
## [21] uuid_1.1-0 lifecycle_1.0.3
## [23] pkgconfig_2.0.3 rsvd_1.0.5
## [25] R6_2.5.1 fastmap_1.1.1
## [27] GenomeInfoDbData_1.2.10 digest_0.6.31
## [29] colorspace_2.1-0 rprojroot_2.0.3
## [31] dqrng_0.3.0 irlba_2.3.5.1
## [33] crosstalk_1.2.0 beachmat_2.17.8
## [35] filelock_1.0.2 labeling_0.4.2
## [37] fansi_1.0.4 httr_1.4.6
## [39] abind_1.4-5 compiler_4.3.1
## [41] here_1.0.1 withr_2.5.0
## [43] BiocParallel_1.35.2 viridis_0.6.3
## [45] UpSetR_1.4.0 highr_0.10
## [47] MASS_7.3-60 DelayedArray_0.27.5
## [49] tools_4.3.1 vipor_0.4.5
## [51] beeswarm_0.4.0 glue_1.6.2
## [53] grid_4.3.1 cluster_2.1.4
## [55] generics_0.1.3 gtable_0.3.3
## [57] tidyr_1.3.0 data.table_1.14.8
## [59] BiocSingular_1.17.0 ScaledMatrix_1.9.1
## [61] metapod_1.9.0 utf8_1.2.3
## [63] XVector_0.41.1 RcppAnnoy_0.0.20
## [65] ggrepel_0.9.3 pillar_1.9.0
## [67] limma_3.57.6 dplyr_1.1.2
## [69] lattice_0.21-8 tidyselect_1.2.0
## [71] locfit_1.5-9.8 knitr_1.43
## [73] gridExtra_2.3 edgeR_3.43.7
## [75] xfun_0.39 datasets_4.3.1
## [77] statmod_1.5.0 lazyeval_0.2.2
## [79] yaml_2.3.7 evaluate_0.21
## [81] codetools_0.2-19 tibble_3.2.1
## [83] cli_3.6.1 uwot_0.1.14
## [85] reticulate_1.30 systemfonts_1.0.4
## [87] munsell_0.5.0 jquerylib_0.1.4
## [89] Rcpp_1.0.10 dir.expiry_1.9.0
## [91] png_0.1-8 parallel_4.3.1
## [93] ellipsis_0.3.2 basilisk.utils_1.13.1
## [95] sparseMatrixStats_1.13.0 bitops_1.0-7
## [97] viridisLite_0.4.2 scales_1.2.1
## [99] purrr_1.0.1 crayon_1.5.2
## [101] rlang_1.1.1 cowplot_1.1.1